Kubernetes踩坑(二): Service IP(LVS)间断性TCP连接故障排查

问题阶段(一):

用户反应某个redis使用卡顿，连接该redis服务使用的是svc代理，即ipvs snat的方式，ipvsadm -L发现，VIP收到的6379端口的数据包，会以rr的方式分别转发到pod的80 6379端口上，相当于会有50%的丢包，不卡才怪：

# ipvsadm | grep -2 10.108.152.210
TCP  10.108.152.210:6379 rr
  -> 172.26.6.185:http            Masq    1      0          0         
  -> 172.26.6.185:6379            Masq    1      3          3

2.检查svc：

apiVersion: v1
kind: Service
metadata:
  creationTimestamp: 2018-12-06T02:45:49Z
  labels:
    app: skuapi-rds-prod
    run: skuapi-rds-prod
  name: skuapi-rds-prod
  namespace: default
  resourceVersion: "41888273"
  selfLink: /api/v1/namespaces/default/services/skuapi-rds-prod
  uid: 09e1a61f-f901-11e8-878f-141877468256
spec:
  clusterIP: 10.108.152.210
  ports:
  - name: redis
    port: 6379
    protocol: TCP
    targetPort: 6379
  selector:
    app: skuapi-rds-prod
  sessionAffinity: None
  type: ClusterIP

可以看出svc配置没有问题，回忆起此前配置此svc时，一开始失误将port和targetPort设置为了默认值80忘记修改了，后面手动kubectl edit svc修改端口为6379。应该是修改保存之后，原本的targetPort 80
依然在ipvs中保留为RIP之一，这应该是一个bug。

3.解决办法：
使用ipvsadm工具删除多余的RIP，或者删除此svc然后重建。

1
2
3

# ipvsadm | grep -2 10.108.152.210
TCP  10.108.152.210:6379 rr
  -> 172.26.6.185:6379            Masq    1      4          0

问题阶段(二):

现象：

有用户反馈svc vip连接经常失败，不限于redis的多个ip有这种情况，于是开始排查

排查:

通过对ipvs和系统的连接状态排查，发现了两个问题:
1.发现有大量TIME_WAIT的tcp连接,证明系统连接大多是短连接，查看ipvs tcpfin的等待时间为2分钟，两分钟还是过长，有必要修改短一些，30秒是一个比较合理的值，如果比较繁忙的服务，这个值可以改到更低。
2.有不少丢包

[root@p020107 ~]# ipvsadm -lnc |wc -l
23776
[root@p020107 ~]# ipvsadm -lnc |grep TIME_WAIT | wc -l
10003
[root@p020107 ~]# ipvsadm -L --timeout 
Timeout (tcp tcpfin udp): 900 120 300

root@p020107:~# netstat -s | grep timestamp
    17855 packets rejects in established connections because of timestamp

备注：
根据tcp 四次挥手协议，主动发起断开连接请求的的一端(简单理解为客户端)，在其发送断开连接请求开始，其连接的生命周期会经历4个阶段分别是FIN-WAIT1 –> FIN_WAIT2 –> TIME_WAIT –> CLOSE，其中2个FIN-WAIT阶段等待的时间就是内核配置的net.ipv4.tcp_fin_timeout 的值，为了快速跳过前两个FIN-WAIT阶段从而进入TIME_WAIT状态，net.ipv4.tcp_fin_timeout值建议缩短。在进入TIME_WAIT状态后，默认等待2个MSL(Max Segment Lifetime)时间，到达最后一步CLOSE状态，关闭tcp连接释放资源。注意：MSL时间在不同平台一般是30s-2min不等，并且基本都是不可修改的(linux将这一时间值写死在了内核中)。

那么为什么要等待2*MSL呢？在stackoverflow中找到了一个较为易懂的解释：

So the TIME_WAIT time is generally set to double the packets maximum age. This value is the maximum age your packets will be allowed to get to before the network discards them.
That guarantees that, before you’re allowed to create a connection with the same tuple, all the packets belonging to previous incarnations of that tuple will be dead.

翻译一下：
time_wait时间设计为tcp分片的最大存活时间的两倍，这么设计的原因是，网络是存在延迟的，同时tcp分片在网络传输中可能出现意外，发送端在确认意外(例如到达MSL时间后)后发出数据分片的重传。假如socket连接不经等待直接关闭了，然后再重新打开了一个端口号一致的连接，可能导致新启动的socket连接，接收到了此前销毁关闭的socket连接的数据。因此，设计TIME_WAIT等待时间为2MSL，是为了保证在等待2MSL之后，此前旧socket的数据分片即使还没有到达接收端，也已经在网络传输中过期消逝了，新启动的socket不会接收到此前的旧数据分片。

优化方式

优化的思路
1.断开连接时加速进入TIME_WAIT状态，以快速提供可用的连接端口
2.解决timestamps丢包问题

查看ipvs设置的各类连接的超时时间，修改默认的tcpfin 2分钟为30秒

[root@p020107 ~]# ipvsadm -L --timeout 
Timeout (tcp tcpfin udp): 900 120 300
[root@p020107 ~]# ipvsadm --set 900 30 300

[root@p020107 ~]# ipvsadm -L --timeout 
Timeout (tcp tcpfin udp): 900 30 300

内核参数优化

添加入/etc/sysctl.conf文件中

# 表示开启重用。允许将TIME-WAIT sockets重新用于新的TCP连接，默认为0，表示关闭；
  net.ipv4.tcp_tw_reuse = 1

  # 表示开启TCP连接中TIME-WAIT sockets的快速回收，默认为0，表示关闭。注意，net.ipv4.tcp_timestamps默认为开启，tcp_tw_recycle此选项也开启后，tcp timestamp机制就会激活，在部分场景下，需要关闭tcp_timestamps功能，见下方此选项的说明。
  net.ipv4.tcp_tw_recycle = 1

  # 修改tcp会话进去fin状态后的等待时间，超时则关闭会话
  net.ipv4.tcp_fin_timeout = 30

  # 处于TIME_WAIT最大的socket数量，默认为180 000，超过这个数目的socket立即被清除
  net.ipv4.tcp_max_tw_buckets=180000

  # tcp缓存时间戳,RFC1323中描述了，系统缓存每个ip最新的时间戳，后续请求中该ip的tcp包中的时间戳如果小于缓存的时间戳(即非最新的数据包
)，即视为无效，相应的数据包会被丢弃，而且是无声的丢弃。在默认情况下，此机制不会有问题，但在nat模式时,不同的ip被转换成同一个ip再去请
求真正的server端，在server端看来，源ip都是相同的，且此时包的时间戳顺序可能不一定保持递增，由此会出现丢包的现象，因此，如果是nat模式工作，建议关闭此选项。
  net.ipv4.tcp_timestamps = 0

这几个参数之间还有一些关联关系，参考此篇文章,写得非常详细:
http://www.freeoa.net/osuport/cluster/lvs-inuse-problem-sets_3111.html

问题阶段(三):

现象

完成上面的操作后，TIME_WAIT数量下降到了4位数，丢包数量没有再增加。但是过了一些天之后，再一次出现了偶尔个别VIP无法建立连接的情况。挑选了其中一个VIP 10.111.99.131开始排查
client: 192.168.58.36
DIP: 10.111.99.131
RIP: 172.26.8.17

开始排查

三层连接没有问题:

ywq@ywq:~$ traceroute 10.111.99.131
traceroute to 10.111.99.131 (10.111.99.131), 30 hops max, 60 byte packets
 1  192.168.58.254 (192.168.58.254)  7.952 ms  8.510 ms  9.131 ms
 2  10.111.99.131 (10.111.99.131)  0.253 ms  0.243 ms  0.226 ms

ywq@ywq:~$ ping 10.111.99.131
PING 10.111.99.131 (10.111.99.131) 56(84) bytes of data.
64 bytes from 10.111.99.131: icmp_seq=1 ttl=63 time=0.296 ms
64 bytes from 10.111.99.131: icmp_seq=2 ttl=63 time=0.318 ms
^C
--- 10.111.99.131 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1020ms
rtt min/avg/max/mdev = 0.296/0.307/0.318/0.011 ms

tcp连接无法建立

1
2
3

ywq@ywq:~$ telnet 10.111.99.131 80
Trying 10.111.99.131...
telnet: Unable to connect to remote host: Connection timeout

在lvs diretor server上查看连接状态：

[root@p020107 ~]# ipvsadm -lnc | grep 58.36
TCP 00:59  TIME_WAIT   192.168.9.13:58236 10.97.85.43:80     172.26.5.209:80
TCP 00:48  SYN_RECV    192.168.58.36:57964 10.111.99.131:80   172.26.8.17:80
TCP 00:48  SYN_RECV    192.168.58.36:57963 10.111.99.131:80   172.26.8.17:80
TCP 00:48  SYN_RECV    192.168.58.36:57965 10.111.99.131:80   172.26.8.17:80

发现tcp连接状态为SYN_RECV状态，那么根据三次握手协定，再结合lvs的工作流程，说明LVS server接收到了clien的syn包,向client回复了syn+ack然后进入了SYN_RECV状态，同时direct server会向后端的real server发起建立连接的请求。既然现在direct server能与client端交互，那么当前的问题应该在于：
direct Server和real Server之间没有正常地进行数据包交互或者出现了丢包，查阅了很多资料，rp_filter这一内核参数可能导致这一问题。

来看一下官方关于这一参数的解释:

rp_filter - INTEGER
	0 - No source validation.
	1 - Strict mode as defined in RFC3704 Strict Reverse Path
	    Each incoming packet is tested against the FIB and if the interface
	    is not the best reverse path the packet check will fail.
	    By default failed packets are discarded.
	2 - Loose mode as defined in RFC3704 Loose Reverse Path
	    Each incoming packet's source address is also tested against the FIB
	    and if the source address is not reachable via any interface
	    the packet check will fail.

	Current recommended practice in RFC3704 is to enable strict mode
	to prevent IP spoofing from DDos attacks. If using asymmetric routing
	or other complicated routing, then loose mode is recommended.

	The max value from conf/{all,interface}/rp_filter is used
	when doing source validation on the {interface}.

	Default value is 0. Note that some distributions enable it
	in startup scripts.

简单解释一下:
0:表示不开启源检测
1:严格模式，根据数据包的源，通过查FIB表(Forward Information Table,可以理解为路由表)，检查数据包进入端口是同时也是出端口，以视为最佳路径，如果不符合最佳路径，则丢弃数据包
2:松散模式,检查数据包的来源，查FIB表，如果通过任意端口都无法到达此源，则丢包

结合使用场景来说:
在LVS (nat)+k8s的工作场景下，LVS Server送往Real Server的包可能走的tunnel接口，而Real Server通过tunnel接口收到包后，查路由表发现回包要走物理eth/bond之类接口，如果rp_filter开启了严格模式，会导致网络异常状况

检查每一台kube node的网卡配置参数,发现centos7.4的几台node默认确实开启了rp_filter，ubuntu大部分则没有:

# 容器内的veth网卡可忽略，因为容器本身只有一块对外的网卡
[root@p020114 ~]# sysctl -a | grep rp_filter | grep -v 'veth'
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.bond0.rp_filter = 1
net.ipv4.conf.default.rp_filter = 0
net.ipv4.conf.docker0.rp_filter = 2
net.ipv4.conf.dummy0.rp_filter = 0
net.ipv4.conf.em1.rp_filter = 1
net.ipv4.conf.em2.rp_filter = 1
net.ipv4.conf.em3.rp_filter = 1
net.ipv4.conf.em4.rp_filter = 1
net.ipv4.conf.kube-bridge.rp_filter = 0
net.ipv4.conf.kube-dummy-if.rp_filter = 0
net.ipv4.conf.lo.rp_filter = 0
net.ipv4.conf.tun-192168926.rp_filter = 1
net.ipv4.conf.tun-192168927.rp_filter = 1
net.ipv4.conf.tun-192168928.rp_filter = 1
net.ipv4.conf.tun-192168929.rp_filter = 1
net.ipv4.conf.tunl0.rp_filter = 0

关闭此功能

echo "
net.ipv4.conf.all.rp_filter = 0
net.ipv4.conf.bond0.rp_filter = 0
net.ipv4.conf.default.rp_filter = 0
net.ipv4.conf.docker0.rp_filter = 2
net.ipv4.conf.dummy0.rp_filter = 0
net.ipv4.conf.em1.rp_filter = 0
net.ipv4.conf.em2.rp_filter = 0
net.ipv4.conf.em3.rp_filter = 0
net.ipv4.conf.em4.rp_filter = 0
net.ipv4.conf.kube-bridge.rp_filter = 0
net.ipv4.conf.kube-dummy-if.rp_filter = 0
net.ipv4.conf.lo.rp_filter = 0
net.ipv4.conf.tun-192168926.rp_filter = 0
net.ipv4.conf.tun-192168927.rp_filter = 0
net.ipv4.conf.tun-192168928.rp_filter = 0
net.ipv4.conf.tun-192168929.rp_filter = 0
net.ipv4.conf.tunl0.rp_filter = 0
" >> /etc/sysctl.conf

加载生效

sysctl -p

总结

VIP偶尔无法建立TCP连接的问题已解决，一个星期过去了没有再复现，继续观察ing.